Data-driven Schema Normalization
نویسندگان
چکیده
Ensuring Boyce-Codd Normal Form (BCNF) is the most popular way to remove redundancy and anomalies from datasets. Normalization to BCNF forces functional dependencies (FDs) into keys and foreign keys, which eliminates duplicate values and makes data constraints explicit. Despite being well researched in theory, converting the schema of an existing dataset into BCNF is still a complex, manual task, especially because the number of functional dependencies is huge and deriving keys and foreign keys is NP-hard. In this paper, we present a novel normalization algorithm called Normalize, which uses discovered functional dependencies to normalize relational datasets into BCNF. Normalize runs entirely data-driven, which means that redundancy is removed only where it can be observed, and it is (semi-)automatic, which means that a user may or may not interfere with the normalization process. The algorithm introduces an efficient method for calculating the closure over sets of functional dependencies and novel features for choosing appropriate constraints. Our evaluation shows that Normalize can process millions of FDs within a few minutes and that the constraint selection techniques support the construction of meaningful relations during normalization. 1. FUNCTIONAL DEPENDENCIES A functional dependency (FD) is a statement of the form X → A with X being a set of attributes and A being a single attribute from the same relation R. We say that the lefthand-side (Lhs) X functionally determines the right-handside (Rhs) A. This means that whenever two records in an instance r of R agree on all their X values, they must also agree on their A value [7]. More formally, an FD X → A holds in r, iff ∀t1, t2 ∈ r : t1[X] = t2[X]⇒ t1[A] = t2[A]. In the following, we consider only non-trivial FDs, which are FDs with A / ∈ X. Table 1 depicts an example address dataset for which the two functional dependencies Postcode→City and Postcode→Mayor hold. Because both FDs have the same Lhs, we c ©2017, Copyright is with the authors. Published in Proc. 20th International Conference on Extending Database Technology (EDBT), March 21-24, 2017 Venice, Italy: ISBN 978-3-89318-073-8, on OpenProceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0 Table 1: Example address dataset First Last Postcode City Mayor Thomas Miller 14482 Potsdam Jakobs Sarah Miller 14482 Potsdam Jakobs Peter Smith 60329 Frankfurt Feldmann Jasmine Cone 01069 Dresden Orosz Mike Cone 14482 Potsdam Jakobs Thomas Moore 60329 Frankfurt Feldmann can aggregate them to the notation Postcode→City,Mayor. The presence of this FD introduces anomalies in the dataset, because the values Potsdam, Frankfurt, Jakobs, and Feldmann are stored redundantly and updating these values might cause inconsistencies. So if, for instance, some Mr. Schmidt was elected as the new mayor of Potsdam, we must correctly change all three occurrences of Jakobs to Schmidt. Such anomalies can be avoided by normalizing relations into the Boyce-Codd Normal Form (BCNF). A relational schema R is in BCNF, iff for all FDs X → A in R the Lhs X is either a key or superkey [7]. Because Postcode is neither a key nor a superkey in the example dataset, this relation does not meet the BCNF condition. To bring all relations of a schema into BCNF, one has to perform six steps, which are explained in more detail later: (1) discover all FDs, (2) extend the FDs, (3) derive all necessary keys from the extended FDs, (4) identify the BCNF-violating FDs, (5) select a violating FD for decomposition (6) split the relation according to the chosen violating FD. The steps (3) to (5) repeat until step (4) finds no more violating FDs and the resulting schema is BCNF-conform. We find several FD discovery algorithms, such as Tane [14] and HyFD [19], that serve step (1), but there are, thus far, no algorithms available to efficiently and automatically solve the steps (2) to (6). For the example dataset, an FD discovery algorithm would find twelve valid FDs in step (1). These FDs must be aggregated and transitively extended in step (2) so that we find, inter alia, First,Last→Postcode,City,Mayor and Postcode→City,Mayor. In step (3), the former FD lets us derive the key {First, Last}, because these two attributes functionally determine all other attributes of the relation. Step (4), then, determines that the second FD violates the BCNF condition, because its Lhs Postcode is neither a key nor superkey. If we assume that step (5) is able to automatically select the second FD for decomposition, step (6) decomposes the example relation into R1(First, Last,Postcode) and R2(Postcode,City,Mayor) with {First, Last} and {Postcode} being primary keys and R1.Postcode→R2.Postcode a foreign key constraint. Table 2 shows this result. When again checking for violating FDs, we do not find any and stop the norSeries ISSN: 2367-2005 342 10.5441/002/edbt.2017.31 Table 2: Normalized example address dataset First Last Postcode
منابع مشابه
Starflake Schema Implementation Using Depth-First Algorithm in Automating Data Normalization
The two most popular schemas used to implement a multi-dimensional model in a relational database are star schema and snowflake schema. A combination of a star schema and snowflake schema, whose aim is to utilize the advantages of both schema, is called a Starflake schema. This study discusses the application of Starflake schema to automate data normalization. The researchers created a system t...
متن کاملSchema label normalization for improving schema matching
Schema matching is the problem of finding relationships among concepts across heterogeneous data sources that are heterogeneous in format and in structure. Starting from the “hidden meaning” associated with schema labels (i.e. class/attribute names) it is possible to discover relationships among the elements of different schemata. Lexical annotation (i.e. annotation w.r.t. a thesaurus/lexical r...
متن کاملSchema Normalization for Improving Schema Matching
Schema matching is the problem of finding relationships among concepts across heterogeneous data sources (heterogeneous in format and in structure). Starting from the “hidden meaning” associated to schema labels (i.e. class/attribute names) it is possible to discover relationships among the elements of different schemata. Lexical annotation (i.e. annotation w.r.t. a thesaurus/lexical resource) ...
متن کاملNF-SS: A Normal Form for Semistructured Schema
Semistructured data is becoming increasingly important for web applications with the development of XML and related technologies. Designing a “good” semistructured database is crucial to prevent data redundancy, inconsistency and undesirable updating anomalies. However, unlike relational databases, there is no normalization theory to facilitate the design of good semistructured databases. In th...
متن کاملReasoning about Schema
Schema mappings are an important tool in several areas of database research. Recently, the topic of reasoning about schema mappings was given attention, in particular revolving around the central concepts of equivalence and optimality. In this chapter, we survey these results. First, we introduce relaxed notions of logical equivalence and show their potential for finding optimized schema mappin...
متن کاملEnhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining
This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes...
متن کامل